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1  Introduction 

This  paper  considers  the  problem  of  estimating  a  vector  of  parameter  (3  6  P  from  the  linear 
problem 

y  =  X/3  +  z,  (1) 

where  y  £  Mn  is  a  vector  of  observations,  X  an  nxp  predictor  matrix,  and  z  a  vector  of  independent 
normal  random  variables.  The  goal  is  to  find  a  relevant  parametric  vector  j3*  £  W  among  many 
potential  candidates  and  obtain  high  prediction  accuracy. 

The  t\  penalized  least  squares  estimator  for  problem  (1)  has  been  the  focus  of  a  great  deal 
of  attention  for  variable  selection  and  estimation  in  high-dimensional  linear  regression  when  the 
number  of  variables  is  much  larger  than  the  sample  size  [10,  20,  23,  25,  26,  29].  Recently  the 
Dantzig  selector  was  proposed  for  problem  (1)  in  [6].  The  Dantzig  selector  j3  £  MP  is  a  solution  to 
the  optimization  problem 


/3  £  argmin{ ||/3 ||i  :  \\D  1XT (X/3  -  y)^  <  6},  (2) 

with  a  fixed  parameter  5  >  0  and  a  diagonal  matrix  D  where  the  diagonal  entries  are  equal  to 
the  norm  of  the  columns  of  X.  Here,  we  write  ||x||g  for  the  lq  norm  of  x  £  Rp,  1  <  g  <  oo. 
Optimal  £2  rate  properties  for  \\/3  —  /5* 1 1 2  were  established  under  a  sparsity  scenario  and  impressive 
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empirical  performance  on  real  world  problems  involving  large  values  of  p  was  shown  in  [6] .  Since 
then  the  Dantzig  selector  has  received  a  considerable  amount  of  attention.  Discussions  on  the 
Dantzig  selector  can  be  found  in  [3,  5,  7,  11,  13,  21,  24],  In  [15],  an  algorithm  was  proposed  for 
fitting  the  entire  coefficient  path  of  the  Dantzig  selector  with  a  similar  computational  cost  to  the 
least  angle  algorithm  that  is  used  to  compute  the  t\  minimization  via  the  LASSO  technique.  The 
Dantzig  selector  is  a  convex,  but  not  strictly  convex,  optimization  problem.  Unique  solutions  are  in 
general  not  guaranteed.  Conditions  ensuring  the  uniqueness  of  the  Dantzig  selector  were  presented 
in  [9].  In  [17]  a  new  class  of  Dantzig  selectors  for  linear  regression  problems  for  right-censored 
outcomes  was  proposed. 

The  importance  of  the  Dantzig  selector  in  linear  regressions  has  been  demonstrated  in  the  afore¬ 
mentioned  work.  Efficient  methods  for  solving  problem  (2),  which  however  were  not  emphasized 
in  the  current  literature,  are  highly  needed.  In  [6],  the  problem  is  cast  as  a  linear  program  which 
is  solved  by  using  a  primal-dual  interior  point  algorithm  [4].  As  it  is  well  known,  interior  point 
methods  are  not  efficient  for  large-scale  problems.  In  [2],  the  problem  is  cast  as  linear  cone  pro¬ 
gramming  problem  for  which  a  smooth  approximation  to  its  dual  problem  is  solved  by  an  optimal 
first-order  method  [1,  22].  Recently,  an  alternating  direction  method  (ADM)  for  finding  the  Dantzig 
selector  was  studied  in  [18].  Numerical  experiments  showed  that  this  method  usually  outperforms 
the  method  in  [2]  in  terms  of  CPU  time  while  producing  solutions  of  comparable  quality.  The 
problem  was  rewritten  in  [18]  in  a  form  to  which  ADM  can  be  easily  applied.  ADM  itself  is  an 
iterative  algorithm.  In  each  iterate,  two  subproblems  are  needed  to  be  solved  successively.  One  of 
the  subproblems  has  a  closed  form  solution,  while  the  other  does  not  and  is  approximated  by  a 
nonmonotone  gradient  method  proposed  in  [19].  To  alleviate  the  difficulty  caused  by  the  subprob¬ 
lem  without  a  closed  form  solution,  a  linearized  ADM  was  proposed  for  the  Dantzig  selector  and 
was  shown  to  be  efficient  for  solving  both  synthetic  and  real  world  data  sets  in  [28]. 

In  this  paper,  the  Dantzig  selectors  for  problem  (2)  are  found  by  an  algorithm  based  upon 
proximity  operators.  We  first  rewrite  the  problem  as  an  unconstrained  structural  optimization 
problem  via  an  indicator  function.  The  resulting  problem  is  then  solved  by  a  primal-dual  algorithm. 
In  comparison  with  the  one  given  in  [18],  our  proposed  algorithm  is  easy  to  implement.  Ours 
achieves  comparable  quality  results  while  consuming  much  less  CPU  time. 

The  outline  of  the  paper  is  organized  as  follows.  In  Section  2  we  present  our  fixed-point  theory 
based  proximity  operator  algorithm  for  solving  problem  (2).  In  Section  3,  we  present  numerical 
experiments  comparing  the  accuracy  and  efficiency  of  the  proposed  algorithm  with  ADM  proposed 
in  [18].  The  first  set  of  experiments  uses  simulated  sparse  signals  and  the  second  set  uses  samples 
of  biomarker  data  to  predict  the  diagnosis  of  leukemia  patients.  Section  4  concludes  the  paper. 

The  following  notation  will  be  used  in  the  rest  of  the  paper.  For  any  vector  u  £  Rrf,  let  ut  and 
u(i)  both  denote  the  i-tb  component  of  u.  Also  for  any  vector  u  £  Rd,  |u|  is  the  component- wise 
absolute  values  of  u,  that  is  the  i-tli  component  of  |u|  is  \m |,  while  sign(u)  is  the  vector  whose  i-th 
component  is  1  if  zq  >  0  and  —1  otherwise.  Given  two  vectors  u  and  v  in  Rd,  x  o  y  denotes  the 
Hadamard  (component-wise)  product  of  u  and  v,  max{it,  v}  denotes  the  vector  whose  i-th  entry 
is  ma x.{ui,Vi},  and  minjit,  u}  denotes  the  vector  whose  i-th  entry  is  min {ui,Vi}.  Let  1  denote  the 
vector  of  all  ones  whose  dimension  should  be  clear  from  the  context. 

The  natural  numbers  are  given  by  N.  For  the  usual  d-dimensional  Euclidean  space  denoted  by 

we  define  (x,  y)  :=  Yli=ixiyii  f°r  xiV  ^  the  standard  inner  product  in  Mrf.  We  denote  by 
||  •  ||i,  ||  •  || 2,  and  ||  •  | |oo  the  l\  norm,  £2  norm,  and  the  norm  of  a  vector,  respectively.  The  class 
of  all  lower  semicontinuous  convex  functions  /  :  Rd  — >  (—00, +00]  such  that  dom/  :=  {x  £  : 

f(x)  <  +00}  /  0  is  denoted  by  ro(M‘i).  For  a  closed  convex  set  C  of  Md,  its  indicator  function  tc 
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is  in  r r 


and  is  defined  as 


iC{u)  :  = 


0,  if  u  g  C, 
+oo,  otherwise. 


For  a  function  /  G  Fo(Md),  argrninx g c  f  (x)  is  the  set  of  points  of  the  given  argument  in  C  for  which 
/  attains  its  minimum  value,  i.e.,  argmin xecfix)  =  {x  G  C  :  f(y)  >  f(x)  for  all  y  G  C}. 

2  The  Dantzig  Selector  with  Proximity  Algorithms 

In  this  section,  we  develop  a  proximity  algorithm  for  solving  the  optimization  problem  (2).  We 
begin  with  reviewing  two  existing  works  on  this  problem,  namely  the  alternating  direction  method 
(ADM)  proposed  in  [18]  and  the  linearized  alternating  direction  method  of  multipliers  (LADM) 
proposed  in  [28].  Both  methods  work  on  the  reformulated  optimization  problem  (2)  with  D  =  I  as 
follows: 

a  „  Ti!  «{II0II1 :  xT(xP  -y)  =  (3) 

/3sRp,tG{t:||t||oo<5} 

where  r  G  MP  is  an  auxiliary  variable.  The  augmented  Lagrangian  function  for  problem  (3)  is 
Lc{P,t,  7)  :=  ||/3||i  +  (7,  Xt(X/3  -  y)  -  r)  +  U\XT (X ft  -  y)  -  t\\%, 


where  7  G  is  the  Lagrange  multiplier  and  c  >  0  is  a  penalty  parameter. 
The  iterative  scheme  of  ADM  for  optimization  problem  (3)  is 

(  Tk+1  <-  argminTe{T.||r||oo<(5}Lc(/3fe,r,7fe), 

<  f3k+l  <—  argmin 0eRPLc(/3,  rfc+1,  yfc), 

(  rjk+1  =1k  +  c(xT(X/3k+1  -y)-  Tk+1), 

which,  with  some  elementary  manipulations,  can  equivalently  be  written  as 


argminTe{T:||T||oo<5}||r  -  ( XT(X/3k  -  y)  +  \) 


argminggMp 


[  7fc+1  =  7fc  +  c(XT (X f3k+1  -y)-  rfc+1). 

The  r-related  subproblem  in  (4)  has  a  closed  form  solution,  but  the  /3-related  subproblem  does  not 
and  is  solved  approximately  by  using  the  nonmonotone  gradient  method  in  [18]. 

The  iterative  scheme  of  LADM  for  optimization  problem  (3)  is 

'  /3fc+1  4-  argmin/36MP{||/3||1  +  c(vk^  -  (3k)  +  |||/3  -  ^|||}, 

<  Tk+1  G-  argminr6rT.||T||oo<(5} ||r  -  ( XT(X/3k+1  -  y)  +  ^)|||,  (5) 

7fc+ 1  =  7fc  +  c{X^(X/3k+1  -y)-  rfc+1), 

where  £  >  0  is  a  proximal  parameter  and  vk  :=  XT X(XT  (X(3k  —  y)  —  Tk  +  T-).  Note  that  the  order 
of  updating  rk+l  and  (3k+1  in  ADM  is  reversed  in  LADM.  For  the  /3-subproblem  in  LADM,  the 
last  two  terms  in  the  objective  function  can  be  viewed  as  the  linearization  of  the  quadratic  term 
§||AT(A/3-y)  —  rk  +  [| I  with  respect  to  (3  at  (3k  after  dropping  a  constant.  Furthermore,  the 

/3-subproblem,  after  completing  the  square  of  these  two  terms  and  ignoring  the  resulting  constant 
term,  is  the  same  as 


argmm/36lRp 


i  +  ^-(/s*-p)ll!. 
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which  has  a  closed  form  solution.  The  r-subproblem  has  a  closed  form  solution  as  in  ADM. 
Therefore,  LADM  can  be  easily  and  efficiently  implemented.  It  was  shown  in  [28]  that  for  any 
c  >  0  and  i  >  2||XTA||2  and  any  initial  iterate  (/3°,r°, 70),  the  sequence  {(/3fc,  rfc,  yfe)  :  k  G  N} 
converges.  Furthermore,  the  limit  of  the  sequence  {((3k,rk)  :  k  G  N}  is  a  solution  of  the  Dantzig 
selector  problem  (3). 

In  the  following,  we  present  our  fixed-point  theory  based  proximity  operator  algorithm  for 
solving  the  optimization  problem  (2).  For  simplicity  of  exposition,  with  the  matrices  X  and  D,  the 
vector  y,  and  the  constant  5  appearing  in  problem  (2),  we  set 

A  :=  D~1XtX,  b  :=  D~1XTy ,  C  :=  {/3  G  :  \\/3  -  <  5}.  (6) 

Then  the  optimization  problem  (2)  can  be  rewritten  as 

P  G  argmin{ H/3 ||i  +  ic(A/3)  :  /3  G  Mp}.  (7) 

The  objective  function  of  this  problem  is  convex  and  coercive  thanks  to  the  t \  -norm  being  coercive. 
Hence  a  solution  to  problem  (7)  exists  and  can  be  characterized  in  terms  of  proximity  operator.  To 
this  end,  we  review  the  definition  of  proximity  operator. 

For  a  function  /  G  ro(Md),  the  proximity  operator  of  /  with  parameter  A,  denoted  by  prox^, 
is  a  mapping  from  M.d  to  itself,  defined  for  a  given  point  x  G  by 

proxAy(x)  :=  argmin  |^lln  —  ^|||  +  f(u)  '■  u  £Wd 

Now,  we  can  present  a  characterization  of  solutions  of  problem  (7)  that  is  simply  derived  from 
Fermat’s  rule. 

Theorem  2.1  Let  the  pxp  matrix  A  and  the  vector  b  G  Rp  be  given  in  (6).  If  [3  G  Rp  is  a  solution 
to  problem  (7),  then  for  any  a  >  0  and  A  >  0  there  exists  a  vector  rGP  such  that 

P  ~  ~Att)  ,  (8) 

a  J 

T  =  (I  —  proxtc )  (A/3  +  r) .  (9) 

Conversely,  if  there  exist  a  >  0  and  A  >  0  such  that  /3,r  G  MP  satisfy  equations  (8)  and  (9),  then 
fl  is  a  solution  of  problem  (7). 

Proof:  The  proof  of  the  result  follows  straightforwardly  a  general  result  in  [16,  Proposition  1],  For 
completeness,  we  present  its  proof  here.  First,  we  assume  that  /3  is  a  solution  to  problem  (7).  By 
Fermat’s  rule  and  the  chain  rule  of  subdifferentiation,  0  G  <9||  •  ||i(/3)  +  AT dtc(Af3).  Then  for  any 
a  >  0  and  A  >  0  there  exists  r  G  jdic{A/3)  such  that  —^Att  G  6  (^||  •  ||i)  (/3),  that  is,  in  terms 
of  proximity  operator,  equation  (8).  Since  the  set  dic(A/3)  is  a  cone,  then  r  G  jdtc{Af3)  implies 
r  G  die  (A/3)  which  is  essentially  equivalent  to  equation  (9). 

Conversely,  if  equations  (8)  and  (9)  are  satisfied,  we  then  have  —^Att  G  <9(^||  •  ||i)  (/3)  and 
r  G  dic(A/3)  accordingly.  Using  the  fact  that  the  set  die  (A/3)  is  a  cone  again,  the  second  inclusion 
r  G  die  (A/3)  implies  G  ^dic(Afj).  Multiplying  AT  to  both  sides  of  the  previous  inclusion 
and  using  the  chain  rule  d(ic  o  A)(/3)  =  AT dic(A(3),  we  have  that  ^ Att  G  ^ d(ic  o  A)(/3).  Since 
-^Tr  €  d(l)(/3),  we  obtain  0  G  <9||  •  ||i(/3)  +  d(ic  o  A)((3).  This  shows  that  (3  is  a  solution  to 
problem  (7).  □ 


4 


We  comment  on  the  computation  of  the  proximity  operators  proxin  m  and  prox  appearing 

CK  II  II  1  ^ 

in  equations  (8)  and  (9).  The  proximity  operator  proxjn. m  at  any  u  €  W  is  the  well-known 

cc  II  II  1 

soft-thresholding  operator  given  as  follow: 

(10) 

Lemma  2.2  Let  6  be  a  constant,  let  b  be  a  vector  in  Mp,  and  let  the  set  C  be  given  in  (6).  Then 
for  any  vector  v  G  , 

prox((, (v)  =  b  +  min{max{u  —  b ,  —  dl},  51 } .  (11) 

and 

(I  -  proxtc)(u)  =  prox(j||.||1(u  -  b). 

Proof:  It  is  well-known  that  the  proximity  operator  proxt(,  is  the  projection  operator  onto  the 
set  C.  Since  the  set  C  is  the  cube  with  b  as  its  center  and  26  as  the  length  of  its  side  in  Mp. 
Hence,  proxtc(u)  the  projection  of  the  vector  v  G  is  given  by  (11).  Further,  it  holds  that 
(/  —  prox,c )(v)  =  (v  —  b)  —  min{max{(u  —  b ),  — 51},  51 } .  From  this  identity,  we  can  directly  check 
that  for  each  i  from  1  to  n 


proxj_M  ||  ( u )  =  sign(u)  o  max  < 

“  '  1  l 


it  I - 1, 0 

a 


(v  —  b)i  —  min{max{ (v  —  b)i,  —5},  5}  =  sign((u  —  b)i)  •  max{|(u  —  6)*|  —  6 ,  0}, 
which,  by  using  equation  (10),  is  prox(5||.||1((u  —  b)i).  This  completes  the  proof.  □ 

As  a  result  of  Lemma  2.2,  equation  (9)  can  be  rewritten  as  follows: 


t  =  prox5||. ip  (A/3  +  t  —  b).  (12) 

Therefore,  by  Theorem  2.1,  finding  a  solution  fd  to  problem  (7)  amounts  to  solving  the  coupled 
fixed-point  equations  (8)  and  (12). 

Two  iterative  schemes  can  be  derived  from  equations  (8)  and  (12).  Let  us  write  equation  (8)  as 
j3  =  prox  in  .11  (/3  —  ^At(2t  —  r)).  With  any  initial  estimates  r-1  =  r°  and  /3°,  the  first  iterative 
scheme  based  upon  equations  (8)  and  (12)  is  as  follows: 

f  Pk+1  =  proxi|M|i(/3fc  -  ±AT(2rk  -  r^1)), 

{  rk+1  =  prox^ij.n^A/S^"1"1  +  rk  —  b). 

We  would  like  to  comment  the  connection  of  this  scheme  with  some  existing  ones.  The  dual 
formulation  of  (7),  as  derived  in  [18],  is 

max{-(6,r)  -  5||r||i  :  ||ATr||00  <  1}. 


Applying  the  primal-dual  hybrid  gradient  method  (see  [12,  Equation  2.18])  to  the  above  dual 
formulation  yields  exactly  the  iterative  scheme  (13).  It  was  further  pointed  out  in  [8]  that  the 
iterative  scheme  (13)  is  essentially  the  same  as  the  linearized  ADM  applying  to  problem  (7).  In 
other  words,  the  iterative  scheme  (13)  is  the  same  as  (5)  in  the  case  of  D  =  I. 

Now,  let  us  introduce  the  second  iterative  scheme  for  problem  (7).  Let  us  write  equation  (12)  as 
t  =  prox<5||.M1(A(2/3  —  j3)  +  r  —  b).  With  any  initial  estimates  =  /3°  and  r°,  the  second  iterative 
scheme  based  upon  equations  (8)  and  (12)  is  as  follows: 


rk+ 1  =  prox5|M!l(A(2/Ifc  -  Pk~l)  +  rk  -  b ), 
/3k+1  =  proxjLM.il  (/ 3k  -  ±ATTk+l). 

a  II  HI 


(14) 
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The  sequence  {(/3 k,rk)  :  k  G  N}  generated  by  the  iterative  schemes  (13)  and  (14)  will  converge 
for  any  initial  seeds  when  A /a  <  1/||A|||.  The  proof  of  this  convergence  result  can  be  found  in 
[8,  16].  Hence,  the  limit  of  the  sequence  {(/3fc,rfc)  :  k  G  N}  is  a  fixed-point  of  equations  (8)  and  (9). 
In  particular,  the  limit  of  the  sequence  {/3k  :  k  €  N}  is  a  solution  to  problem  (7). 

As  noted  in  [6] ,  the  Dantzig  selector  often  slightly  underestimates  the  true  values  of  the  nonzero 
parameters.  To  correct  this  bias  and  increase  performance  in  practical  settings,  a  postprocessing 
procedure  was  proposed  in  [6].  Assume  that  f3°°  is  the  limit  of  the  sequence  {/ 3k  :  k  £  N}  that 
is  generated  through  the  iterative  scheme  (14).  This  postprocessing  consists  of  two  steps.  The 
first  step  is  to  estimate  A  :=  {i  :  j3?°  /  0},  the  support  of  the  vector  f3°°.  Let  X\  be  the  n  x  |A| 
submatrix  obtained  by  extracting  the  columns  of  X  corresponding  to  the  indices  in  A,  and  let  /3A 
be  the  |  A  | -dimensional  vector  obtained  by  extracting  the  coordinates  of  f3  £  Rp  corresponding  to 
the  indices  in  A.  The  second  step  of  the  postprocessing  is  to  construct  the  estimator  f3  £  RP  such 
that 

Pa  =  argmin{||ATA/3  -  y\\2  :  ft  £  Rp} 

and  set  the  other  coordinates  to  zero.  If  the  matrix  X~^X\  is  invertible  then  /3A  =  (Xj^X\)~1X~[y. 

Putting  all  above  discussion  together,  a  complete  two-stage  procedure  for  finding  a  solution  of 
problem  (7)  is  described  in  Algorithm  1. 


Algorithm  1  (Two-stage  scheme  for  problem  (7)) 

Input:  Set  the  fixed  parameters 

y  e  Mn,  A  e  Rpxp,  b  e  Rp, 

Initialization:  Set  the  initial  parameters 

r°  =  o,  r1 

Stage-I:  Generate  the  sequence  {(rfc,/3fc)  : 
while  (stopping  criterion  not  met)  do 

rk+ 1  <-  prox5|M|1(A(2/3fc  -  pk~l)  +  rk  -  6), 

Pk+i  proXi  _  ±ATrk+1), 
k^k  +  1. 

end  while 

Stage-II:  Let  (r°° ,  P°°)  be  the  last  set  of  parameters  computed  in  Stage-I. 

•  Approximate  supp(/3°°)  by  A  =  {j  :  |/3°°(j)|  <  tol}. 

•  Compute  v  =  argrnin„eM|A|  {||Aau  -  y ||2}. 

•  Extend  v  to  form  the  Dantzig  selector  ft  on  A: 

f3(A(i))  =  v(i),  for  7  =  1:  |A|, 
jMj)  =  0,  for  j  £  A. 


6,  a ,  tol  £  M+,  and  A  =  0.999a/|| A\\\. 

=  P°  =  0,  and  k  =  0. 
k  £  N}  using  Equations  (10)  and  (14). 


Stage-I  of  Algorithm  1  terminates  once  the  sequence  {(rk,Pk)  :  k  £  N}  reaches  a  stationary 
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point.  To  estimate  when  this  occurs,  terminate  the  iterations  when  either  of  the  following  stopping 
criteria  are  met: 


1.  The  relative  change  between  successive  terms  in  the  sequence  {/ 3k }  falls  below  a  specified 
tolerance; 

\\/3k+1  - /3k\L 

— - —  <  r 

ll  nh  ll  ^  c’ 


2. 


for  some  e  >  0,  or 

The  support  of  the  sequence  {/3k}  is  stationary  for  a  specified  number  of  successive  iterations; 

supp  {(3k)  =  supp(/3fc+1)  =  •  •  •  =  supp(/3fc+r?), 


for  a  fixed  r)  €  N  and  some  positive  integer  k. 

Stage-I  is  the  largest  contributor  to  the  computational  complexity  of  Algorithm  1,  with  each 
iteration  having  complexity  0{Anp).  In  comparison,  each  outer  loop  of  ADM  computing  the  r 
and  7- related  subproblems  has  complexity  O(Anp),  while  each  inner  loop  of  ADM  approximating 
the  /3-related  subproblem  has  complexity  0(8np).  In  general,  Algorithm  1  and  ADM  will  use 
a  different  number  of  iterations  to  terminate  their  iterative  stages,  so  their  overall  complexities 
cannot  be  directly  compared.  However,  the  numerical  experiments  in  the  next  section  indicate 
that  Algorithm  1  tends  to  have  less  overall  complexity  than  ADM  since  Algorithm  1  has  a  shorter 
runtime  even  in  situations  where  it  requires  more  iterations. 


3  Numerical  Experiments 

In  the  following  experiments,  we  apply  the  proposed  proximity  operator  based  approach  presented 
in  Algorithm  1  and  the  alternating  direction  method  (ADM)  presented  in  [18]  to  solve  the  Dantzig 
selector  problem  (2)  using  both  synthetic  and  real  data  sets.  The  experiments  using  synthetic 
data  are  performed  in  MATLAB  R2013a  on  single  nodes  of  the  Condor  Supercomputer,  hosted  at 
AFRL/RIT  Affiliated  Resource  Center.  The  full  capabilities  of  Condor  were  not  taken  advantage 
of;  we  ran  the  algorithms  in  serial  using  single  nodes  to  emulate  a  typical  high  end  consumer 
workstation.  Each  utilized  node  is  equipped  with  an  Intel  Xeon  X5650  6  core  CPU,  with  2.67  GHz 
and  6x8  GB  RAM.  The  experiments  using  the  real  data  set  are  performed  in  MATLAB  R2014a 
on  a  PC  with  an  Intel  Core  i7-3630QM  2.40  GHz  processor  and  16  GB  RAM  running  Windows  7 
Enterprise. 

Example  3.1  Synthetic  Data  Set 

In  this  series  of  simulations,  sparse  coefficient  vectors  are  generated  then  recovered  from  noisy 
random  linear  observations  using  both  Algorithm  1  and  ADM.  The  parameters  used  are  n  = 
720 m,  p  =  2560m  and  s  =  80m  for  m  €E  {2,  3, . . . ,  10},  and  a  €  {0.01,  0.05,  0.10,  0.15}  corre¬ 
sponding  to  1%,  5%,  10%  and  15%  noise  levels.  For  each  combination  of  m  and  a,  100  simulations 
each  of  Algorithm  1  and  ADM  are  performed.  All  other  parameters  for  ADM  are  selected  following 
the  guidelines  in  [18,  Section  3]  and  the  parameters  selected  for  the  initialization  stage  of  Algo¬ 
rithm  1  are  t.ol  =  2a,  a  =  0.2||A||2  and  6  =  a\j2  logp.  The  parameters  for  the  stopping  criteria 
are  e  =  10-4  and  rj  =  max{[41og(a)  log(<j)  +  2 a]  ,  5}.  The  parameters  tol,5  and  77  depend  on 
the  noise  level  a,  which  in  practice  may  not  be  known  a  priori.  However,  the  noise  level  may 
be  well-approximated  using  existing  methods.  In  the  event  that  the  noise  level  is  not  accurately 
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approximated,  the  speed  of  convergence  of  Algorithm  1  will  be  affected,  but  the  accuracy  should 
not  suffer  much.  The  stopping  criteria 

The  n  x  p  sensing  matrices  X  are  generated  for  each  simulation  with  independent  Gaussian 
entries  normalized  so  each  column  has  unit  t 2  norm.  To  generate  the  coefficient  vector,  for  each 
simulation  a  support  set  S  of  size  |5|  =  s  is  selected  uniformly  at  random.  Then  the  vector  (3  with 
indices  in  S  is  defined  according  to  Ps(i)  =  e*(l  +  la*l)i  where  {aj}  is  a  collection  of  independently 
and  identically  distributed  random  variables  sampled  from  the  standard  normal  distribution  and 
{ej}  is  a  collection  of  independently  and  identically  distributed  random  variables  sampled  from  the 
uniform  distribution  on  {—1, 1}.  For  i  ^  S,  set  /3*  =  0.  Then  Algorithm  1  and  ADM  are  used 
to  approximate  the  Dantzig  selector  f3  from  the  observations  y  =  X/3  +  z,  where  z  is  a  collection 
of  independent  and  identically  distributed  random  variables  sampled  from  the  normal  distribution 
with  mean  zero  and  standard  deviation  a. 

Algorithm  1  without  Postprocessing  Algorithm  1  with  Postprocessing 


Figure  3.1:  A  demonstration  of  the  accuracy  of  the  Dantzig  selector  recovered  using  Algorithm  1  and 
ADM  with  and  without  postprocessing  for  a  single  simulation  of  Experiment  3.1  with  parameters 
a  =  0.05  with  (n,p,s)  =  (720,2560,80). 


the  accuracy 


P  := 


w-M 

Y7j= 1  min {/3],  cr2} 


1/2 


(15) 


where  /?  denotes  the  true  parameter  and  j3  denotes  the  parameter  recovered  using  either  Algorithm  1 
or  ADM.  The  denominator  term  of  Equation  (15)  is  the  expected  mean  squared-error  of  the  ideal 


estimator  [6].  Therefore,  p  >  0,  and  a  smaller  p  implies  a  more  accurate  estimator. 


Accuracy  (a  =  0.01) 


Accuracy  (a  =  0.05) 


m 


Accuracy  (a  =  0.10) 


m 

Accuracy  (a  =  0.15) 


Figure  3.2:  A  comparison  of  p ,  computed  as  in  Equation  (15)  which  measures  the  accuracy  of 
the  approximated  Dantzig  selectors,  for  Algorithm  1  and  ADM  for  noise  levels  u  =  0.01,0.05,0.10 
and  0.15  in  Example  3.1.  In  each  plot,  the  points  along  the  curve  represent  the  mean  number  of 
iterations  required  for  each  parameter  m  over  100  simulations,  and  the  points  on  the  vertical  lines 
represent  one  standard  deviation  away  from  the  means. 


The  effects  of  Stage-II  of  Algorithm  1  and  the  postprocessing  step  of  ADM  are  illustrated  in 
Figure  3.1.  The  figure  displays  values  of  the  exact  simulated  vector  /3  and  of  the  Dantzig  selector 
/3  approximated  by  each  algorithm,  first  without  performing  postprocessing  (the  left  column  of 
Figure  3.1)  and  then  with  postprocessing  (the  right  column  of  Figure  3.1)  for  one  simulation 
with  parameters  ( n,p,s )  =  (720,2560,80)  and  noise  a  =  0.05.  One  can  clearly  see  that  the 
postprocessing  not  only  corrects  the  underestimated  magnitudes  of  nonzero  components  of  the 
estimates,  but  also  eliminates  unwanted  nonzero  components. 

The  results  of  the  above  simulations  suggest  that  Algorithm  1  has  less  overall  complexity  than 
ADM,  since  the  accuracy  of  the  Dantzig  selectors  approximated  by  each  method  are  similar  yet 
Algorithm  1  completes  much  faster  than  ADM,  even  when  requiring  more  iterations.  Figure  3.2 
displays  the  mean  and  standard  deviation  of  p  over  100  simulations  for  each  parameter  m  and  u 
and  for  both  Algorithm  1  and  ADM.  Note  that  the  accuracy  of  the  Dantzig  selector  approximated 
by  the  two  algorithms  are  very  similar  across  all  parameter  levels.  Figure  3.3  displays  the  mean 
and  standard  deviation  of  the  CPU  time,  and  Figure  3.4  displays  the  mean  and  standard  deviation 
of  the  total  number  of  iterations  performed  by  Algorithm  1  and  the  total  number  of  iterations 
performed  in  the  inner  loop  of  ADM  for  100  simulations  for  each  parameter  m  and  a.  From  the 


9 


CPU  Time  (a  =  0.01 )  CPU  Time  (a  =  0.05) 


CPU  Time  (a  =  0.10)  CPU  Time  (a  =  0.15) 


Figure  3.3:  A  comparison  of  the  CPU  time  required  to  recover  the  Dantzig  selector  using  Algo¬ 
rithm  1  and  ADM  for  noise  levels  a  =  0.01,  0.05, 0.10  and  0.15  as  in  Example  3.1.  In  each  plot,  the 
points  along  the  curve  represent  the  mean  number  of  iterations  required  for  each  parameter  m  over 
100  simulations,  and  the  points  on  the  vertical  lines  represent  one  standard  deviation  away  from 
the  means. 


figures,  one  can  see  that  although  Algorithm  1  requires  more  iterations  than  ADM,  Algorithm  1 
completes  significantly  faster. 


Example  3.2  Leukemia  Data  Set 


In  this  experiment,  the  Dantzig  selectors  produced  by  Algorithm  1  and  by  ADM  are  used  with 
a  collection  of  biomarker  data  to  indicate  whether  a  patient  may  be  diagnosed  with  a  specific  type 
of  cancer.  The  biomarker  dataset,  first  introduced  in  [14]  and  studied  in  [27,  28],  contains  the 
measurements  of  7128  genes  related  to  leukemia  diagnoses.  The  dataset  is  split  into  a  training  set 
and  a  testing  set.  The  training  set  is  sampled  from  38  patients,  27  of  whom  were  diagnosed  with 
acute  lymphocytic  leukemia  (ALL)  and  11  with  acute  mylogenous  leukemia  (AML).  The  testing 
set  is  sampled  from  34  patients,  20  diagnosed  with  ALL  and  14  with  AML. 

Let  Atrain  G  ]g>38x7i28  contain  the  biomarker  data  in  the  training  set,  where  each  row  is  all  7128 
gene  measurements  of  a  single  patient  and  each  column  has  been  normalized  to  have  unit  1 2  norm. 
Let  i/train  G  M38  be  the  column  vector  indicating  the  diagnosis  of  each  patient  in  the  training  set: 


//train  ( j  ) 


0,  if  patient  j  in  the  training  set  is  diagnosed  with  ALL, 
1,  if  patient  j  in  the  training  set  is  diagnosed  with  AML. 
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m 

Iterations  (o  =  0.10) 


Iterations  (a  =  0.05) 


Iterations  (o  =  0.15) 


Figure  3.4:  A  comparison  of  the  number  of  iterations  required  to  recover  the  Dantzig  selector  using 
Algorithm  1  and  ADM  for  noise  levels  a  =  0.01,0.05,0.10  and  0.15  in  Example  3.1.  In  each  plot, 
the  points  along  the  curve  represent  the  mean  number  of  iterations  required  for  each  parameter  m 
over  100  simulations,  and  the  points  on  the  vertical  lines  represent  one  standard  deviation  away 
from  the  means. 


Similarly  define  Atest  £  K34x7128  and  //test  £  M34  from  the  data  in  the  testing  set. 

This  experiment  has  a  training  phase  and  a  testing  phase.  In  the  training  phase,  a  sparse  vector 
(5  is  found  such  that  Xtr»™d  =  Strain-  To  preprocess  the  data,  only  the  biomarkers  with  the  largest 
variance  are  used  to  train  the  parameter  /3.  To  this  end,  select  a  positive  integer  N,  and  let  A  be 
the  N  indices  of  columns  from  Atrain  with  largest  variance.  Let  Xtrain  £  M38xAr  be  the  submatrix 
of  Xtrain  with  columns  in  A.  Form  the  reduced  problem 


Pa  £  argminfleKN  j  | 


XJ 


train 


-^-train/5  2/train 


<  s 


(16) 


The  Dantzig  selector  f3\  £  satisfying  problem  (16)  is  computed  using  Algorithm  1  and  ADM, 
then  extended  to  form  f3  £  M7128  via 


I P(MJ))  =  Pa{j),  for  j  =  1  :  N, 

3(k )  =  0,  if  k  ^  A. 

In  the  testing  phase,  the  trained  parameter  /3  is  used  to  predict  the  diagnoses  of  patients  in  the 
testing  set.  The  predictive  indicator  vector  ytest  £  K34  is  computed  from  y  =  XtestP  by  thresholding 
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<  0.49, 

<  y(j)- 

0.51  <  y(j)}-  For  values  of  j  such  that 

<  I y(j)  -  yi\, 

<  I y(j)  -  yo\- 

The  jth  patient  in  the  testing  set  is  predicted  to  have  a  diagnosis  of  ALL  if  2/testO)  =  0  and  a 
diagnosis  of  AML  if  ytest(j)  =  L 

The  above  procedure  was  used  to  predict  the  diagnoses  of  patients  in  the  testing  set  using 
the  Dantzig  selector  (3\  computed  using  both  Algorithm  1  and  ADM  with  parameters  N  =  1000, 
a  =  || -^train-strain Hi  and  tol  =  0.1  and  stopping  criteria  parameters  r/  =  80  and  e  =  10-4  for  each  6 
in  {0.0625,  0.125,  0.1875,  0.25,  0.3125,  0.375}.  Figure  3.5  displays  the  results  of  these  simulations 
regarding  the  accuracy  of  the  recovered  indicator  vector  ytes t  in  predicting  the  leukemia  diagnoses  of 
patients  in  the  testing  set,  as  well  as  the  number  of  iterations  and  CPU  runtime  used  by  Algorithm  1 
and  ADM.  As  shown  in  Figure  3.5(a),  Algorithm  1  typically  predicted  the  diagnoses  of  patients  with 
higher  acuracy  than  ADM.  Moreover,  for  each  parameter  d,  Algorithm  1  used  fewer  iterations  than 
ADM  and  the  time  used  by  Algorithm  1  was  several  orders  of  magnitude  less  than  the  time  used 
by  ADM,  as  shown  in  Figures  3.5(c)  and  (d).  Figure  3.5(b)  illustrates  the  tendency  of  Algorithm  1 
to  predict  the  diagnosis  of  patients  in  the  testing  set  with  higher  accuracy  than  ADM.  This  plot 
displays  the  values  of  y  =  Xtest/3  recovered  using  Algorithm  1  and  by  ADM  prior  to  the  thresholding 
step,  along  with  the  true  values  of  j/test-  Since  the  values  recovered  by  Algorithm  1  tend  to  be  more 
spread  out,  it  is  easier  to  accurately  separate  them  into  two  distinct  clusters. 

4  Conclusion 

In  this  paper,  we  have  developed  an  iterative  algorithm  to  compute  the  Dantzig  selector,  the  solution 
to  the  minimization  problem  in  problem  (2).  The  algorithm  is  based  on  the  proximity  operator 
and  its  relationship  to  problem  (7).  The  two-stage  algorithm  we  proposed  is  an  improvement  over 
some  other  recently  proposed  methods  to  find  the  Dantzig  selector,  which  require  the  use  of  inner 
loop  to  estimate  parameters  within  each  step  of  the  algorithm.  Additionally,  our  proposed  method 
uses  a  novel  stopping  criterion  based  upon  the  support  of  the  approximated  parameters. 

We  compare  the  proposed  algorithm  to  the  alternating  direction  method  proposed  in  [18]. 
Theoretically,  two  methods  produce  results  of  similar  quality,  however  each  iteration  of  Stage-I 
of  Algorithm  1  has  less  computational  complexity  than  each  iteration  of  the  inner  loop  of  the 
alternating  direction  method.  The  numerical  experiments  demonstrate  that  the  proposed  method 
and  the  alternating  direction  method  typically  approximate  the  Dantzig  selectors  with  similar 
accuracy,  yet  Algorithm  1  produces  results  in  significantly  less  time,  whether  it  uses  more  iterations 
than  the  alternating  direction  method,  as  in  Experiment  3.1,  or  fewer  iterations  than  the  alternating 
direction  method,  as  in  Experiment  3.2. 


and  clustering  values  near  the  threshold  boundary.  Set 

2/test  0)  = 


0,  if  y(j) 
1,  if  0.51 


Let  y0  =  ma x{y(j)  :  y(j)  <  0.49}  and  y\  =  min {y(J)  : 
0.49  <  y(j)  <  0.51,  set 

0,  if  |y(j)  -  yol 


2/test  ( j  )  — 


1. 


if  1 2/0  )  -  2/1 1 
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Number  of  Misdiagnosed  Patients 


Predicted  Indicator 


(a)  The  number  of  patients  in  the  testing  set 
misdiagnosed  by  the  predicted  indicator  vec¬ 
tor  recovered  by  Algorithm  1  and  ADM  for 
various  values  of  the  parameter  S. 


Number  of  Iterations 


(b)  Values  of  the  actual  diagnosis  indicator 
vector  j/test  along  with  values  of  the  predicted 
indicator  vectors  recovered  by  Algorithm  1 
and  ADM  prior  to  separating  values  into  clas¬ 
sification  groups  for  S  =  0.25. 


0.0625  0.125  0.1875  0.25  0.3125  0.375 

6 


(c)  The  number  of  iterations  required  by  Al¬ 
gorithm  1  and  ADM  to  recover  the  Dantzig 
selector  /3a  for  various  values  of  the  parame¬ 
ter  6. 


(d)  The  CPU  runtime  required  by  Algo¬ 
rithm  1  and  ADM  to  recover  the  Dantzig  se¬ 
lector  /3a  for  various  values  of  the  parameter 
<5. 


Figure  3.5:  Plots  regarding  the  indicator  vector,  used  to  predict  a  leukemia  diagnosis  in  patients 
in  the  testing  set  as  in  Example  3.2,  using  both  Algorithm  1  and  ADM. 
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the  Dantzig  selector  using  the  alternating  direction  method  and  for  sharing  the  real  dataset  used 
in  Example  3.2. 
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